[Kubernetes leader election] Run leader elector at all times #4542

constanca-m · 2024-04-08T13:32:45Z

What does this PR do?

This PR covers the issue elastic/beats#38543.

The fix was already merged for beats: elastic/beats#38471.

Now we need to align the agent with the same logic as well.

We do that by making sure the leader elector is always running. Previously, when an instance lost the lease, the leader elector would finish running, and it would not start again.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Related issues

How to test

Build elastic-agent:

EXTERNAL=true SNAPSHOT=true DEV=true PLATFORMS=linux/amd64 PACKAGES=docker mage package

Create stack and kind cluster, and connect the cluster to the stack:

elastic-package stack up --version=8.14.0-SNAPSHOT
kind create cluster
docker network connect elastic-package-stack_default kind-control-plane

Build docker image:

cd build/package/elastic-agent/elastic-agent-linux-amd64.docker/docker-build
docker build -t custom-agent-image .

Load this image in your kind cluster:

kind load docker-image custom-agent-image:latest

Deploy agent with that image:

containers:
  - name: elastic-agent
    image: custom-agent-image:latest
    imagePullPolicy: Never

Results

The following test was done to make sure the leader elector run again after losing the lease:

Create 2 node cluster, so there are two agent running:
Change the lease, so that the first leader loses it.
First leader:

c@c:~$ kubectl get leases -n kube-system | grep elastic-*
elastic-agent-cluster-leader           elastic-agent-leader-elastic-agent-standalone-kmx9m                         9m24s

Change the lease holder. You can do it like this.

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: elastic-agent-cluster-leader
  namespace: kube-system
spec:
  holderIdentity: change

After a while, check the new leader:

c@c:~$ kubectl get leases -n kube-system | grep elastic-*
elastic-agent-cluster-leader           elastic-agent-leader-elastic-agent-standalone-67x8n                         11m

Change the lease again, and make sure the leader gets repeated. This way we know that the leader elector kept running.

c@c:~$ kubectl get leases -n kube-system | grep elastic-*
elastic-agent-cluster-leader           elastic-agent-leader-elastic-agent-standalone-kmx9m                         14m

As we can see, the leader repeated.

mergify · 2024-04-08T13:33:31Z

This pull request does not have a backport label. Could you fix it @constanca-m? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

changelog/fragments/1712583231-leader-election-issue.yaml

internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection.go

Co-authored-by: Blake Rouse <blake.rouse@elastic.co>

…ernetes_leaderelection.go Co-authored-by: Blake Rouse <blake.rouse@elastic.co>

cmacknz

Requesting changes until we get an answer on if this needs rate limiting and the impacts on the k8s control plane traffic.

The current implementation definitely needs to change, but it's not clear to me that this change won't just make rapid calls to acquire leases as fast as possible yet.

cmacknz · 2024-04-08T14:44:41Z

internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection.go

-	return comm.Err()
+
+        for {
+		le.Run(ctx)


If there is a cluster error resulting in the leader continuously losing the lease, will this result in attempts to acquire it as quickly as possible with no rate limit?

The implementation of Run I see is:

// Run starts the leader election loop. Run will not return // before leader election loop is stopped by ctx or it has // stopped holding the leader lease func (le *LeaderElector) Run(ctx context.Context) { defer runtime.HandleCrash() defer func() { le.config.Callbacks.OnStoppedLeading() }() if !le.acquire(ctx) { return // ctx signalled done } ctx, cancel := context.WithCancel(ctx) defer cancel() go le.config.Callbacks.OnStartedLeading(ctx) le.renew(ctx) }

There have been several escalations showing that failing to appropriately rate limit k8s control plane API calls like leader election can destabilize clusters.

What testing have we done to ensure this change won't cause issues like this?

We have already been doing this. This change will not affect the number of calls, it just makes sure that at least 1 agent will be reporting metrics. The problem with the implementation now is that run goes like this:

Keeps trying to acquire the lease

Acquire the lease

Loose the lease and stop running
All the while, all the other instances were trying to acquire the lease already.

We do not have many SDHs on this bug, which leads me to believe that it is rare for an agent to lose the lease. But we do have problems when agents stop reporting metrics, and the only way we knew how to restart that was to make the pod run again - that is, force run() to run again.

There are some parameters in the config regarding the lease:

elastic-agent/internal/pkg/composable/providers/kubernetesleaderelection/config.go

Lines 17 to 19 in 43cb148

LeaseDuration int `config:"leader_leaseduration"`

RenewDeadline int `config:"leader_renewdeadline"`

RetryPeriod int `config:"leader_retryperiod"`

If it becomes necessary to reduce the amount of times an agent tries to acquire it.

cmacknz · 2024-04-08T15:20:57Z

Before I remove my requested changes, do we have a way to test this in this repository?

It looks like you'd need a test that you can get the lease again after you've lost it.

I see it is possible to inject the client used in the lock implementation, which takes an interface:

elastic-agent/internal/pkg/composable/providers/kubernetesleaderelection/kubernetes_leaderelection.go

Line 80 in 09b7410

Client: client.CoordinationV1(),

I also see a fake client is used in the secrets tests:

elastic-agent/internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets_test.go

Line 36 in 09b7410

client := k8sfake.NewSimpleClientset()

Is is possible to mock out the lease client to the test this?

constanca-m · 2024-04-08T15:27:05Z

Before I remove my requested changes, do we have a way to test this in this repository?

For the beats PR, I did add an unit test to check that. However, we had a way to change the start leading and stop function there, and this repository I cannot get inside those functions to check if that is working. Because of that, I don't think it is possible to test.

There is a way to force the lease to be updated, and sometimes the leader changes. The problem is that we are working with seconds in the lease duration fields, so if I push a test for that, it might take 2min, is that a good idea? I cannot do less than seconds for those fields unlike in the beats PR, because these fields are integers in seconds, and changing them to time.duration means a breaking change. @cmacknz

cmacknz · 2024-04-08T21:04:59Z

For the beats PR, I did add an unit test to check that. However, we had a way to change the start leading and stop function there, and this repository I cannot get inside those functions to check if that is working. Because of that, I don't think it is possible to test.

Can't you allow access to the lease callbacks you need through an internal constructor, so that the ContextProviderBuilder used externally remains unmodified?

There is a way to force the lease to be updated, and sometimes the leader changes. The problem is that we are working with seconds in the lease duration fields, so if I push a test for that, it might take 2min, is that a good idea? I cannot do less than seconds for those fields unlike in the beats PR, because these fields are integers in seconds, and changing them to time.duration means a breaking change. @cmacknz

As above you could bypass the Config object in ContextProviderBuilder(logger *logger.Logger, c *config.Config, managed bool) using an internal constructor for tests that gives you access to the underlying time.Duration fields.

It seems possible to test this here, although it does look like it'll be more work than it was to write than it was in Beats.

The upstream tests are also an interesting source of inspiration https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection_test.go, but they test at a lower level and use a lot more test machinery from their own packages.

blakerouse

You made the requested changes I asked for and those changes look good.

I agree with Craig on all points for adding unit tests to cover this code

constanca-m · 2024-04-09T08:31:39Z

I added a unit test @blakerouse @cmacknz. The test does the following:

// TestNewLeaderElectionManager will test the leader elector.
// We will try to check if an instance can acquire the lease more than one time. This way, we will know that
// the leader elector starts running again after it has stopped - which happens once a leader looses the lease.
// To make sure that happens we will do the following:
// 1. We will create the lease to be used by the leader elector.
// 2. We will create two context providers - in the default context, this would mean two nodes, each one with an agent running.
// We will wait for one of the agents, agent1, to acquire the lease, before starting the other.
// 3. We force the lease to be acquired by the other agent, agent2.
// 4. We will force the lease to be acquired by the agent1 again. To avoid the agent2 reacquiring it multiple times,
// we will stop this provider and make sure the agent1 can reacquire it.

constanca-m · 2024-04-09T08:34:52Z